The document reads .ris bibliographic files, filters
selected studies, and categorises data sources into Articles,
Packages, and Kaggle. All entries are classified by
use-case type, data type,
sport, population, and
synthetic-generation potential.
A final evaluation scores all datasets according to predefined criteria, comparing their suitability to be used to generate synthetic dataset using Statistical and/or GAN-based approaches.
Kaggle.com platform.To compile, classify, and evaluate publicly available sports datasets based on data source, methodological characteristics, and use-case categories.
Import bibliographic datasets and conduct manual and Shiny-based screening.
Merge Article, Package, and Kaggle datasets into a single structured dataset.
Classify each dataset into five use-case types:
Extract detailed metadata: population, sport domain, geographic region, study design, sample size, and data type.
Apply an eight-criterion scoring system to evaluate dataset quality.
Rank datasets separately for Statistical and GAN-based applications.
| Source Type | Number of Datasets |
|---|---|
| Articles | 32 |
| Packages | 12 |
| Kaggle | 6 |
| Category | Number of Datasets | Data Types | Population | Most Frequent Sports | Top 3 (by Score) |
|---|---|---|---|---|---|
| GAN-based | 16 | Video, Image | Athlete | Multiple, Basketball, Fitness | TeamTrack, C-Sports, SportsMOT |
| Statistical | 34 | Tabular, Physiological, Medical Record, Survey, Accelerometer | Athlete, Multiple | Football, Baseball, Basketball, Fitness | MTS-5, NCAA-ISP, LLBD |
| Use Case | Description | Examples |
|---|---|---|
| Movement | Pose estimation, motion tracking, biomechanical analysis | LLBD, HTHARD, RBD, WEAR |
| Tactical | Formation recognition, event detection, game-situation analysis | TeamTrack, C-Sports, SportsMOT, MultiSports |
| Performance | Fatigue prediction, load and performance monitoring | MTS-5, ScopeSense, PMData, Lahman |
| Injury | Impact simulation, risk modelling, unsafe event reconstruction | NCAA-ISP, NEISS, FFTSC-10Y, NHL-ATR |
| Player | Player identity, jersey recognition, technical skill prediction | SportsHHI, NFBDB2026, nflfastR/hoopR/nhlapi |
## [1] "ebscoSport.ris" "ieee.ris" "qut.ris"
## [4] "scienceDirect.ris" "springerNature.ris" "wos.ris"
# Error in webofS, remove empty line
wos <- readLines("data/database/ris/wos.ris")
wos <- wos[wos != ""]
writeLines(wos, "data/database/ris/wos.ris")# Read all as list, conver to df
files <- list.files("data/database/ris", pattern = "\\.ris$", full.names = T)
bibliography <- read_bibliography(filename = files, return_df = T)
bibliography# Title preparition
bibliography$titleLower<-tolower(bibliography$title)
bibliography$titleLower<-strip(bibliography$titleLower, apostrophe.remove = TRUE)
head(bibliography$titleLower)## [1] "secondary prevention of musculoskeletal sports injuries a scoping review of early detection and early intervention strategies"
## [2] "the effects of rule changes in footballcode team sports a systematic review"
## [3] "how physical education teachers are positioned in models scholarship a scoping review"
## [4] "physical education from lgbtq students perspective a systematic review of qualitative studies"
## [5] "the altmetric score has a stronger relationship with article citations than journal impact factor and open access status a crosssectional analysis of sport sciences articles"
## [6] "methods of the national collegiate athletic association injury surveillance program â through â"
## [1] "crosssectional and longitudinal associations of active travel organised sport and physical education with accelerometerassessed moderatetovigorous physical activity in young people the international childrenâs accelerometry database"
## [2] "match score dataset for team ball sports"
## [3] "collective sports a multitask dataset for collective activity recognition"
## [4] "tgc reid a dataset for sport event reidentification in the wild"
## [5] "regular sports services dataset of demographic frequency and service level agreement"
## [6] "aspset an outdoor sports pose video dataset with d keypoint annotations"
## [7] "dataset for the analysis of tv viewer response to live sport broadcasts and sponsor messages"
## [8] "sports work strategy of college counselors based on mysql database big data analysis"
## [9] "epidemiology of testicular trauma in sports analysis of the national electronic injury surveillance system database"
## [10] "administrative databases used for sports medicine research demonstrate significant differences in underlying patient demographics and resulting surgical trends"
## [11] "analysis of research trends on elbow pain in overhead sports a bibliometric study based on web of science database and vosviewer"
## [12] "the racial and sexual differences in emergency department visits for sportrelated spine fracture injuries a neiss database study"
## [13] "comprehensive dataset on presarscov infection sportsrelated physical activity levels disease severity and treatment outcomes insights and implications for covid management"
## [14] "analysis of a comprehensive dataset influence of vaccination profile types and severe acute respiratory syndrome coronavirus reinfections on changes in sportsrelated physical activity one month after infection"
# Remove duplicated titles, keeping the first unique entry
bibliography <- bibliography[!duplicated(bibliography$titleLower), ]
# Check that duplicates are gone
any(duplicated(bibliography$titleLower))## [1] FALSE
## [1] 278 104
Filtering the dataset to keep only the selected articles, reducing
the number from 278 to 89.
bibliographyRev <- read.csv("data/database/bibliography/bibliographyRev.csv")
bibliographyRev <- bibliographyRev %>%
filter(screened_abstracts == "selected") %>%
dplyr::select(author, title, year, keywords, abstract, doi, titlelower,
filename)
# write.csv(bibliographyRev, "data/database/bibliography/bibliographyRevSelected.csv",
# row.names = FALSE)
dim(bibliographyRev)## [1] 89 8
## [1] "author" "title" "year" "keywords" "abstract"
## [6] "doi" "titlelower" "filename"
From the output file above, an excel file was created manually to
categorise the databases into Articles
(sheet = databaseAR), Packages (R and Python)
(sheet = databasePA), and Kaggle
(sheet = databaseOT).
Articles: Databases were searched using the keywords “sport” AND “database” or “sport” AND “dataset” for publicly available datasets.
Packages: Active and maintained packages were selected with databases related to athletes were included.
Kaggle: In the datasets category,
the keywords used were “injuries”, “sport”,
“NFL”, and “AFL”. In the
competitions category, only “sport” was used.
For both categories, only the top 10 datasets were
reviewed.
## [1] "bibliography" "databaseAR" "databasePA" "databaseOT" "database"
## [6] "rank" "summary"
The database sheet contains the merged data from all
files, and the summary sheet will be used to generate
insights and visualisations.
# Read the summary sheet
summary <- read_excel("data/database/bibliography/bibliographyRevSelected.xlsx",
sheet = "summary")
colnames(summary)## [1] "column" "study.title" "dataset.name"
## [4] "dataset" "dataset.type" "methods.model"
## [7] "use.case.type" "use.case" "aim.dataset"
## [10] "valid.data" "total.score" "synthetic.generation"
## [13] "country" "year.start" "year.end"
## [16] "year.range" "population.age.range" "population.type"
## [19] "population.sex" "population" "sample.overall"
## [22] "sample.raw" "sample.size" "study.design"
## [25] "sport.type" "sports.covered" "data.type"
## [28] "variables.collected" "literature.category"
colorPalette <- RColorBrewer::brewer.pal(8, "Set2")
f1 <- plot_ly(summary,
x = ~population.type, y = ~sample.overall,
type = 'scatter', mode = 'markers',
color = ~population.type, colors = colorPalette,
size = ~sample.overall, sizes = c(10, 60),
marker = list(opacity = 0.7, line = list(width = 1, color = '#333')),
hoverinfo = 'text',
text = ~paste('Dataset:', dataset.name,
'<br>Samples:', sample.overall,
'<br>Population:', population.type),
showlegend = FALSE)
f2 <- plot_ly(summary %>% count(sport.type),
x = ~sport.type, y = ~n, type = 'bar',
color = ~sport.type, colors = colorPalette,
showlegend = FALSE)
f3 <- plot_ly(summary %>% count(data.type),
x = ~n, y = ~reorder(data.type, n),
type = 'bar', orientation = 'h',
color = ~data.type, colors = colorPalette,
showlegend = FALSE)
f4 <- plot_ly(summary,
x = ~data.type, y = ~sample.overall,
type = 'scatter', mode = 'markers',
color = ~valid.data, colors = c('#E15759', '#59A14F'),
size = ~sample.overall, sizes = c(10, 50),
marker = list(opacity = 0.7),
hoverinfo = 'text',
text = ~paste('Dataset:', dataset.name,
'<br>Type:', data.type,
'<br>Valid:', valid.data,
'<br>Samples:', sample.overall))
fig <- subplot(f1, f2, f3, f4, nrows = 2, margin = 0.20) %>%
layout(
plot_bgcolor = "rgba(0,0,0,0)",
paper_bgcolor = "rgba(0,0,0,0)",
showlegend = TRUE,
legend = list(orientation = "h", x = 0.55, y = -0.15),
annotations = list(
list(text = "Sample Size by Population Type",
x = 0.20, y = 1.05, showarrow = FALSE,
xref='paper', yref='paper', font=list(size=14)),
list(text = "Datasets by Sport Type",
x = 0.80, y = 1.05, showarrow = FALSE,
xref='paper', yref='paper', font=list(size=14)),
list(text = "Data Type Distribution", x = 0.20, y = 0.47,
showarrow = FALSE, xref='paper', yref='paper', font=list(size=14)),
list(text = "Sample Size vs Data Type (by Validity)",
x = 0.80, y = 0.47, showarrow = FALSE, xref='paper',
yref='paper', font=list(size=14))
)
)
fig# Duplicate the rows by column and country.
# Dataset with multiple countries will have multiple rows
summaryMap <- summary %>%
mutate(country = str_split(country, ",")) %>%
unnest(country) %>%
mutate(country = str_trim(country))
summaryMap# Generate the information to display in the map
countrySummary <- summaryMap %>%
group_by(country) %>%
summarise(
nDatasets = n(),
datasets = paste(unique(column), collapse = "; "),
studyDesigns = paste(unique(study.design), collapse = "; "),
sampleRange = paste0("Min: ", min(sample.raw, na.rm = TRUE),
" | Max: ", max(sample.overall, na.rm = TRUE)),
population = paste(unique(population.type), collapse = "; "),
sex = paste(unique(population.sex), collapse = "; "),
sports = paste(unique(sport.type), collapse = "; "),
reference = paste(unique(dataset), collapse = "; ")
)
countrySummary# Create hover text with the information above
countrySummary <- countrySummary %>%
mutate(hoverText = paste0(
"<b>", country, "</b><br>",
"Datasets: ", nDatasets, "<br>",
"Study Design: ", studyDesigns, "<br>",
"Sample Range: ", sampleRange, "<br>",
"Population: ", population, "<br>",
"Sex: ", sex, "<br>",
"Sports: ", sports, "<br>",
"Dataset Names: ", datasets, "<br>",
"Reference: ", reference
))
countrySummaryThe following map does not display the International
(n = 9) and
Commonwealth countries(n = 1) datasets.
Additionally, the plot allows us to visualise the different types of datasets:
Article: Data validated and used in a
paper.Package: Dataset can be extracted from a
CRAN or Python.Kaggle: Available from the website
Kaggle.com.# Prepare the dataset selecting the relevant columns
variables <- summary %>%
select(column, sport.type, data.type, variables.collected, dataset) %>%
mutate(
sourceType = case_when(
str_detect(dataset, regex("Kaggle", ignore_case = TRUE)) ~ "Kaggle",
str_detect(dataset, regex("CRAN|Python", ignore_case = TRUE)) ~ "Package",
TRUE ~ "Article"
),
variables.collected = str_replace_all(
variables.collected,
regex("(\\d+\\.)\\s*", ignore_case = TRUE),
"<br>• "
),
variables.collected = paste0("<b>Variables:</b>", variables.collected)
)
variablesThe following plot links three sections:
Each flow represents a connection between these sections and is
colored by its data source type
(Kaggle-Orange, Package-Green, or Article-Blue). Move the
mouse over a flow to see the type of variables included in that
connection.
# Create Node List
nodes <- data.frame(
name = unique(c(variables$column, variables$sport.type, variables$data.type))
)
# Function to map each label to numeric index
get_index <- function(x) match(x, nodes$name) - 1
# Links
links <- bind_rows(
variables %>%
transmute(
source = get_index(column),
target = get_index(sport.type),
type = sourceType,
hover = variables.collected
),
variables %>%
transmute(
source = get_index(sport.type),
target = get_index(data.type),
type = sourceType,
hover = variables.collected
)
)
color_map <- c(
"Kaggle" = "#FFB347",
"Package" = "#77DD77",
"Article" = "#779ECB"
)
links$color <- color_map[links$type]
# Plotly Sankey
fig <- plot_ly(
type = "sankey",
arrangement = "snap",
node = list(
label = nodes$name,
color = "grey",
pad = 15,
thickness = 20,
line = list(color = "black", width = 0.5)
),
link = list(
source = links$source,
target = links$target,
value = rep(1, nrow(links)),
color = links$color,
customdata = links$hover,
hovertemplate = "%{customdata}<extra></extra>"
))
fig <- fig %>%
layout(
title = list(
text = "Variables Across Sports Datasets",
font = list(size = 18, color = "#333", family = "Roboto")
),
font = list(size = 12),
margin = list(l = 10, r = 10, t = 60, b = 10),
annotations = list(
list(
x = 0.00, y = 1.05,
text = "<b>Datasets</b>",
showarrow = FALSE,
xref = "paper", yref = "paper",
font = list(size = 14, color = "#FFB347", family = "Roboto")
),
list(
x = 0.50, y = 0.76,
text = "<b>Sports</b>",
showarrow = FALSE,
xref = "paper", yref = "paper",
font = list(size = 14, color = "#77DD77", family = "Roboto")
),
list(
x = 0.95, y = 0.75,
text = "<b>Variables</b>",
showarrow = FALSE,
xref = "paper", yref = "paper",
font = list(size = 14, color = "#779ECB", family = "Roboto")
)
)
)
figManually will proceed analysing and scoring all the datasets based on the following table:
We have added the rank sheet to the main file to store
the scores. Two new columns were generated manually named as
TotalScore representing the scores assigned to each dataset
and literatureCategory representing the category assigned
by the literature review analysis (GAN-based or
Statistical).
# Select the variable of interest
summaryScore <- summary %>%
select(column, total.score, literature.category, valid.data,
population.type, sport.type, data.type)
summaryScore# Rename the columns
summaryScore <- summaryScore %>%
rename(dataset = column,
group = literature.category,
value = total.score) %>%
mutate(group = as.factor(group)) %>%
arrange(group, desc(value))# Create two dataframes to separate plots. Plots will have hover with each dataset info
statsD <- filter(summaryScore, group == "Statistical")
ganD <- filter(summaryScore, group == "GAN-based")
colorPalette <- setNames(
colorRampPalette(brewer.pal(min(max(length(unique(summaryScore$data.type)), 3), 8),
"Set2"))(length(unique(summaryScore$data.type))),
unique(summaryScore$data.type))
stat <- plot_ly(statsD,
x = ~value,
y = ~reorder(dataset, value),
type = 'bar',
orientation = 'h',
color = ~data.type,
colors = colorPalette,
hoverinfo = 'text',
marker = list(line = list(width = 1.5)),
text = ~paste(
"<b>Dataset:</b>", dataset,
"<br><b>Value:</b>", round(value, 3),
"<br><b>valid.data:</b>", valid.data,
"<br><b>Population:</b>", population.type,
"<br><b>Sport:</b>", sport.type,
"<br><b>Data Type:</b>", data.type
))
gan <- plot_ly(ganD,
x = ~value,
y = ~reorder(dataset, value),
type = 'bar',
orientation = 'h',
color = ~data.type,
colors = colorPalette,
hoverinfo = 'text',
marker = list(line = list(width = 1.5)),
text = ~paste(
"<b>Dataset:</b>", dataset,
"<br><b>Value:</b>", round(value, 3),
"<br><b>valid.data:</b>", valid.data,
"<br><b>Population:</b>", population.type,
"<br><b>Sport:</b>", sport.type,
"<br><b>Data Type:</b>", data.type
))
legend <- data.frame(
data.type = unique(summaryScore$data.type),
color = unname(colorPalette[unique(summaryScore$data.type)])
)
legendM <- plot_ly()
for(i in seq_len(nrow(legend))) {
legendM <- legendM %>%
add_trace(
type = "scatter",
mode = "markers+text",
x = 1, y = i,
marker = list(size = 14, color = legend$color[i]),
text = legend$data.type[i],
textposition = "right",
hoverinfo = "none",
showlegend = FALSE
) %>%
layout( title = "Data Type",
xaxis = list(
visible = FALSE,
zeroline = FALSE,
showgrid = FALSE,
showticklabels = FALSE
),
yaxis = list(
visible = FALSE,
zeroline = FALSE,
showgrid = FALSE,
showticklabels = FALSE
)
)
}
p <- subplot(
subplot(stat, gan, nrows = 2, shareX = T, titleY = TRUE), legendM,
widths = c(0.70, 0.30)) %>%
layout(title = "Ranking of datasets by Approach (Statistical vs GAN-based Approaches)",
showlegend = F,
yaxis = list(title = "GAN-based", automargin = TRUE),
yaxis2 = list(title = "Statistical", automargin = TRUE))
p# Wrap text to multiple lines for readability
wrap_text <- function(x, width = 25) str_wrap(x, width = width)
summary <- summary %>%
mutate(
methods.model = wrap_text(methods.model, 100),
use.case = wrap_text(use.case, 100),
aim.dataset = wrap_text(aim.dataset, 200),
variables.collected = wrap_text(variables.collected, 200),
country = wrap_text(country, 100),
methods.model = wrap_text(methods.model, 100),
year.range = wrap_text(year.range, 100),
population = wrap_text(population, 100),
sample.size = wrap_text(sample.size, 100),
sports.covered = wrap_text(sports.covered, 100),
data.type = wrap_text(data.type, 100),
study.design = wrap_text(study.design, 100),
literature.category = wrap_text(literature.category, 100)
)
summary# Level 1: dataset.type
lvl1 <- summary %>%
distinct(dataset.type) %>%
mutate(
ids = dataset.type,
labels = dataset.type,
parents = ""
)
# Level 2: use.case.type
lvl2 <- summary %>%
distinct(dataset.type, use.case.type) %>%
mutate(
ids = paste(dataset.type, use.case.type, sep = "-"),
labels = use.case.type,
parents = dataset.type
)
# Level 3: sport.type
lvl3 <- summary %>%
distinct(dataset.type, use.case.type, sport.type) %>%
mutate(
ids = paste(dataset.type, use.case.type, sport.type, sep = "-"),
labels = sport.type,
parents = paste(dataset.type, use.case.type, sep = "-")
)
# Level 4: dataset
lvl4 <- summary %>%
distinct(
dataset.type, use.case.type, sport.type, dataset,
dataset.name, methods.model, use.case,
aim.dataset, variables.collected, country, year.range,
population, sample.size, sports.covered, data.type,
study.design, literature.category
) %>%
mutate(
aim.dataset = str_replace_all(
aim.dataset,
regex("\\b(\\d+\\.)", ignore_case = TRUE),
"<br>\\1 "
),
variables.collected = str_replace_all(
variables.collected,
regex("\\b(\\d+\\.)", ignore_case = TRUE),
"<br>\\1 "
),
ids = paste(dataset.type, use.case.type, sport.type, dataset, sep = "-"),
labels = paste0(
"<b>Dataset:</b> ", dataset, "<br>",
"<b>Name:</b> ", dataset.name, "<br>",
"<b>Method:</b> ", methods.model, "<br>",
"<b>Use Case:</b> ", use.case, "<br>",
"<b>Aim:</b> ", aim.dataset, "<br>",
"<b>Variables:</b> ", variables.collected, "<br>",
"<b>Country:</b> ", country, "<br>",
"<b>Year Range:</b> ", year.range, "<br>",
"<b>Population:</b> ", population, "<br>",
"<b>Sample Size:</b> ", sample.size, "<br>",
"<b>Sports Covered:</b> ", sports.covered, "<br>",
"<b>Data Type:</b> ", data.type, "<br>",
"<b>Study Design:</b> ", study.design, "<br>",
"<b>Category:</b> ", literature.category
),
parents = paste(dataset.type, use.case.type, sport.type, sep = "-")
)
treeD <- bind_rows(
lvl1,
lvl2,
lvl3,
lvl4
)
treeD# Insert colours
levels <- c(
"#E69F00", # Level 1 (dataset.type)
"#009E73", # Level 2 (use.case.type)
"#0072B2", # Level 3 (sport.type)
"#000000" # Level 4 (dataset)
)
treeD <- treeD %>%
mutate(
level = case_when(
parents == "" ~ 1,
grepl("^[^-]+-[^-]+$", ids) ~ 2,
grepl("^[^-]+-[^-]+-[^-]+$", ids) ~ 3,
TRUE ~ 4
),
colors = levels[level]
)
plot_ly(
treeD,
type = "treemap",
ids = ~ids,
labels = ~labels,
parents = ~parents,
marker = list(colors = ~colors),
textinfo = "label+children"
)